{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e096bf13",
   "metadata": {},
   "source": [
    "![transformer_architecture](images/transformer_architecture.jpeg)\n",
    "\n",
    "1. 源自encoder-decoder，在其他类型任务时，干掉decoder，只使用encoder；\n",
    "2. 和RNN相比，最大的变化是并行化，为了并行化引入了：位置嵌入 (Positional Encoding) \n",
    "3. self-attention,自注意力机制\n",
    "4. \n",
    "4. transformer\n",
    "5. BERT\n",
    "6. 基于BERT的各种nlp任务(文本分类，Qu相关性，中文分词，实体识别)\n",
    "\n",
    "* https://wmathor.com/index.php/archives/1438/\n",
    "* https://baijiahao.baidu.com/s?id=1651219987457222196&wfr=spider&for=pc\n",
    "* https://zhuanlan.zhihu.com/p/59629215\n",
    "* https://github.com/kuro7766/nlp_notebooks/blob/79866eab826e2ed15e41ef1ed2831415619e3c60/5-1.Transformer/Transformer.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2a1b762f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch import nn, Tensor\n",
    "\n",
    "# 1. Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b05ac7bb",
   "metadata": {},
   "source": [
    "2. Positional Encoding\n",
    "\n",
    "位置嵌入的维度为 [max_sequence_length, embedding_dimension], 和词向量的维度是相同的，因此，词嵌入和位置嵌入值可以相加。\n",
    "\n",
    "位置嵌入参数是否可调整？\n",
    "\n",
    "位置嵌入的生成，随着 embedding_dimension 序号增大，位置嵌入函数的周期变化越来越平缓\n",
    "\n",
    "https://wmathor.com/index.php/archives/1453/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "52e8a8c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "class PositionalEncoding(nn.Module):\n",
    "    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):\n",
    "        super().__init__()\n",
    "        self.dropout = nn.Dropout(p=dropout)\n",
    "\n",
    "        position = torch.arange(max_len).unsqueeze(1)\n",
    "        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))\n",
    "        pe = torch.zeros(max_len, 1, d_model)\n",
    "        pe[:, 0, 0::2] = torch.sin(position * div_term)\n",
    "        pe[:, 0, 1::2] = torch.cos(position * div_term)\n",
    "        self.register_buffer('pe', pe)\n",
    "\n",
    "    def forward(self, x: Tensor) -> Tensor:\n",
    "        \"\"\"\n",
    "        Args:\n",
    "            x: Tensor, shape [seq_len, batch_size, embedding_dim]\n",
    "        \"\"\"\n",
    "        x = x + self.pe[:x.size(0)]\n",
    "        return self.dropout(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5036b869",
   "metadata": {},
   "source": [
    "3. self-attention\n",
    "\n",
    "参考这个：https://z3.ax1x.com/2021/04/20/c7wF9x.png#shadow\n",
    "\n",
    "![QKV](https://z3.ax1x.com/2021/04/20/c7wF9x.png#shadow)\n",
    "定义三个矩阵Wq,Wk,Wv，维度=embedding_dimension，大小vocab_size. 类比CNN里的卷积核？\n",
    "\n",
    "![QKV_compute](https://z3.ax1x.com/2021/04/20/c7wk36.png#shadow)\n",
    "查询向量*键矩阵，再softmax，再乘以值向量，再求和。\n",
    "\n",
    "Q和K的点乘表示Q和K元素之间(每个元素都是向量)的相似程度.\n",
    "\n",
    "为啥要除以dk，假设 Q 和 K 的均值为0，方差为1。它们的矩阵乘积将有均值为0，方差为dk，因此使用dk的平方根被用于缩放，因为，Q 和 K 的矩阵乘积的均值本应该为 0，方差本应该为1，这样可以获得更平缓的softmax。当维度很大时，点积结果会很大，会导致softmax的梯度很小。为了减轻这个影响，对点积进行缩放。\n",
    "\n",
    "序列中的每个单词(token)和该序列中其余单词(token)进行 attention 计算\n",
    "\n",
    "对应padding部分，填充0，softmax时需要计算，给无效区域加一个很大的负数偏置，经过softmax后值为0。\n",
    "https://wmathor.com/index.php/archives/1557/\n",
    "https://wmathor.com/index.php/archives/1540/\n",
    "\n",
    "self-attention的可解释性：\n",
    "透過計算Query和各個Key的相似性，得到每個Key對應Value的權重係數，權重係數代表訊息的重要性，亦即attention score；\n",
    "Value則是對應的訊息，再對Value進行加權求和，得到最終的Attention/context vector。\n",
    "https://github.com/nadavbh12/Character-Level-Language-Modeling-with-Deeper-Self-Attention-pytorch/blob/master/modules/attention.py\n",
    "\n",
    "3.1 Multi-Head Attention\n",
    "不同的关注点？\n",
    "多头或者堆叠，Wq,Wk,Wv矩阵是共享的吗？\n",
    "\n",
    "\n",
    "* https://pytorch.org/docs/master/generated/torch.nn.MultiheadAttention.html?highlight=attention#torch.nn.MultiheadAttention\n",
    "* https://pytorch.org/docs/master/generated/torch.nn.TransformerDecoder.html?highlight=transformer#torch.nn.TransformerDecoder\n",
    "* https://pytorch.org/docs/master/generated/torch.nn.TransformerEncoderLayer.html?highlight=attention\n",
    "* https://pytorch.org/docs/master/generated/torch.nn.TransformerEncoder.html?highlight=transformer#torch.nn.TransformerEncoder\n",
    "* https://pytorch.org/docs/master/generated/torch.nn.TransformerDecoderLayer.html?highlight=attention\n",
    "* https://pytorch.org/docs/master/generated/torch.nn.Transformer.html?highlight=transformer#torch.nn.Transformer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8b17c911",
   "metadata": {},
   "outputs": [],
   "source": [
    "#padding部分，有效区=0，无效区=-inf(很大的负数偏置)\n",
    "def generate_square_subsequent_mask(self, sz: int) -> Tensor:\n",
    "    r\"\"\"Generate a square mask for the sequence. The masked positions are filled with float('-inf').\n",
    "        Unmasked positions are filled with float(0.0).\n",
    "    \"\"\"\n",
    "    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)\n",
    "    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))\n",
    "    return mask"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66bbc144",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 4. Add & Norm\n",
    "# 词嵌入 + multi-head attention, 残差连接\n",
    "# Layer Normalization， 和batch Normal不一样，按列求均值，再求方差.  训练和预测时有何不同？\n",
    "# 每个 Encoder Block 中的 FeedForward 层权重都是共享的？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9b09795",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 5. Feed forward\n",
    "# linear(relu(linear(x)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d74b9ece",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Masked Self-Attention\n",
    "# Mask 首先生成一个下三角全 0，上三角全为负无穷的矩阵，然后将其与 Scaled Scores 相加，\n",
    "# 再做 softmax，就能将 - inf 变为 0，得到的这个矩阵即为每个字之间的权重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4839d11",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 堆叠的意义？\n",
    "# 特定层是有独特的功能的，底层更偏向于关注语法，顶层更偏向于关注语义"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d09a547b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Masked Encoder-Decoder Attention\n",
    "# 和前面 Masked Self-Attention 很相似，结构也一摸一样，唯一不同的是这里的  \n",
    "# K,V 为 Encoder 的输出， Q 为 Decoder 中 Masked Self-Attention 的输出"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
